arXiv 



Natural Language Processing (almost) from Scratch 



Ronan CoUobert 

NEC Labs America, Princeton NJ. 

Jason Weston 

Google, New York, NY. 

Leon Bottou 

Michael Karlen 

Koray Kavukcuoglu^ 

Pavel Kuksa^ 

NEC Labs America, Princeton NJ. 



RONANiaCOLLOBERT.COM 

JWESTON@GOOGLE.COM 

LEON@BOTTOU.ORG 
MICHAEL .KARLEN@GMAIL . COM 
KORAY® CS . NYU . EDU 
PKUKSA@CS . RUTGERS . EDU 




1. 



Abstract ^ 

Wc propose a unified neural network architectl^re and learning al^Cithm that can be applied 
to various natural langiiago processing tasks including; part-of-speech tagging, chunking, 
named entity recognition, and semantic role labeling. This versatility is achieved by trying 
to avoid task-specific engineering and therefore disregarding a laWoi prior knowledge. 
Instead of c;xploiting man-made input features carefully optimized for each task, our system 
learns internal representations on the basis of vast amounts of mostly unlabeled training 
data. This work is then used as a basis for builAng a freely ilvailable tagging system with 
good performance and minimal ^^^pjutational rejj^uirements. 

Keywords: Natural Language Processing, Neural Networks 
Introduction 



Will a computer program ever be able to convert a piece of English text into a data structure 
that unambiguously and completely describes the meaning of the natural language text? 
Among numerous problems, no consensus has emerged about the form of such a data 
structure. Until such fundamental Artificial Intelligence problems are resolved, computer 
scientists must settle for reduced objectives: extracting simpler representations describing 
restricted aspects of the textual information. 

These simpler representations are often motivated by specific applications, for instance, 
bag-of-words variants for information retrieval. These representations can also be motivated 
by our belief that they capture something more general about natural language. They 
can describe syntactic information (e.g. part-of-spccch tagging, chunking, and parsing) or 
semantic information (e.g. word-sense disambiguation, semantic role labeling, named entity 
extraction, and anaphora resolution). Text corpora have been manually annotated with such 
data structures in order to compare the performance of various systems. The availability of 
standard benchmarks has stimulated research in Natural Language Processing (NLP) and 
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effective systems have been designed for all these tasks. Such systems are often viewed as 
software components for constructing real-world NLP solutions. 

The overwhelming majority of these state-of-the-art systems address a benchmark 
task by applying linear statistical models to ad-hoc features. In other words, the 
researchers themselves discover intermediate representations by engineering task-specific 
features. These features are often derived from the output of preexisting systems, leading 
to complex runtime dependencies. This approach is effective because researchers leverage 
a large body of linguistic knowledge. On the other hand, there is a great temptation to 
optimize the performance of a system for a specific benchmark. Although such performance 
improvements can be very useful in practice, they teach us little about the means to progress 
toward the broader goals of natural language understanding and the elusive goals of Artificial 
Intelligence. 

In this contribution, we try to excel on multiple benchmarks while avoiding task-specific 
enginering. Instead we use a single learning system able to discover adequate internal 
representations. In fact we view the benchmarks as indirect measurements of the relevance 
of the internal representations discovered by the learning procedure, and we posit that these 
intermediate representations are more general than any of the benchmarks. Our desire to 
avoid task-specific engineered features led us to ignore a large body of linguistic knowledge. 
Instead we reach good performance levels in most of the tasks by transferring intermediate 
representations discovered on large unlabeled datasets. We call this approach "almost from 
scratch" to emphasize the reduced (but still important) reliance on a priori NLP knowledge. 

The paper is organized as follows. Section [2] describes the benchmark tasks of 
interest. Section [3] describes the unified model and reports benchmark results obtained with 
supervised training. Section |4] leverages large unlabeled datasets (~ 852 million words) 
to train the model on a language modeling task. Performance improvements are then 
demonstrated by transferring the unsupervised internal representations into the supervised 
benchmark models. Section [5] investigates multitask supervised training. Section [6] then 
evaluates how much further improvement can be achieved by incorporating standard NLP 
task-specific engineering into our systems. Drifting away from our initial goals gives us the 
opportunity to construct an all-purpose tagger that is simultaneously accurate, practical, 
and fast. We then conclude with a short discussion section. 

2. The Benchmark Tasks 

In this section, we briefly introduce four standard NLP tasks on which we will benchmark 
our architectures within this paper: Part-Of-Speech tagging (POS), chunking (CHUNK), 
Named Entity Recognition (NER) and Semantic Role Labeling (SRL). For each of them, 
we consider a standard experimental setup and give an overview of state-of-the-art systems 
on this setup. The experimental setups are summarized in Table [l| while state-of-the-art 
systems are reported in Table [2] 

2.1 Part-Of-Speech Tagging 

POS aims at labeling each word with a unique tag that indicates its syntactic role, e.g. 
plural noun, adverb, ... A standard benchmark setup is described in detail by [Toutanova 
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Table 1: Experimental setup: for each task, we report the standard benchmark we used, 
the dataset it relates to, as well as training and test information. 



System 



Accuracy System 



Fl 



Shen et al. 



(2007) 



Toutanova et al. 



(2003) 



Gimenez and Marquez (2004) 



(a) POS 



System 



Ando and Zhang 


(2005) 


Florian et al.'('2003) 




Kudo and Matsumoto 


(2001) 



(c) NER 



97.33% 
97.24% 
97.16% 



Fl 
89.31% 
88.76% 
88.31% 



Shen and Sarkar^ ^OOS^) 



Sha and Pereira 



Kudo and Matsumoto 



95.23% 
94.29% 
(20011) 93.91% 



(2003) 



I 



(b) CHUNK 



System 



Koomen et al. 



Pradhan et al. 



Haghighi'^taL (2005) 



(2005) 



(2005) 



Fl 
77.92% 
77.30% 
77.04% 



(d) SRL 



Table 2: State-of-the-art systems on four NLP tasks. Performance is reported in per-word 
accuracy for POS, and Fl score for CHUNK, NER and SRL. Systems in bold will be referred 
as benchmark systems in the rest of the paper (see text). 



et al. (2003). Sections 0-18 of Wall Street Journal (WSJ) data are used for training, while 
sections 19-21 are for validation and sections 22-24 for testing. 

The best POS classifiers are based on classifiers trained on windows of text, which are 
then fed to a bidirectional decoding algorithm during inference. Features include preceding 
and following tag context as well as multiple words (bigrams, trigrams. . . ) context, and 
handcrafted features to deal with unknown words. Toutanova et al. (2003), who use 



maximum entropy classifiers, and a bidirectional dependency network (Heckerman et al 



2001 ) at inference, reach 97.24% per-word accuracy. Gimenez and Marquez (2004) proposed 
a SVM approach also trained on text windows, with bidirectional inference achieved with 
two Viterbi decoders (left-to-right and right-to-left). They obtained 97.16% per-word 
accuracy. More recently, Shen et al. (2007) pushed the state-of-the-art up to 97.33%, 



with a new learning algorithm they call guided learning, also for bidirectional sequence 
classification. 
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2.2 Chunking 

Also called shallow parsing, chunking aims at labeling segments of a sentence with syntactic 
constituents such as noun or verb phrases (NP or VP). Each word is assigned only one unique 
tag, often encoded as a begin-chunk (e.g. B-NP) or inside-chunk tag (e.g. I-NP). Chunking 
is often evaluated using the CoNLL 2000 shared taslij^ Sections 15-18 of WSJ data are 
used for training and section 20 for testing. Validation is achieved by splitting the training 
set. 



Kudoh and Matsumoto (|2000|) won the CoNLL 2000 challenge on chunking with a Fl- 

Each 



score of 93.48%. Their system was based on Support Vector Machines (SVMs^ 
SVM was trained in a pairwise classification manner, and fed with a window around the 
word of interest containing POS and words as features, as well as surrounding tags. They 
perform dynamic programming at test time. Later, they improved their results up to 



93.91% (Kudo and Matsumoto, 2001) using an ensemble of classifiers trained with different 



tagging conventions (see Section 3.2.3). 

Since then, a certain number of systems based on second-order random fields were 



reported (Sha and Pereira 2003 McDonald et al. , 2005 Sun et al. , 2008), all reporting 
around 94. ^ 
tags. 



Fl score. These systems use features composed of words, POS tags, and 



More recently, Shen and Sarkar (2005) obtained 95.23% using a voting classifier scheme, 
where each classifier is trained on different tag representation^ (lOB, lOE, . . . ). They use 
POS features coming from an external tagger, as well carefully hand-crafted specialization 
features which again change the data representation by concatenating some (carefully 
chosen) chunk tags or some words with their POS representation. They then build trigrams 
over these features, which are finally passed through a Viterbi decoder a test time. 



2.3 Named Entity Recognition 

NER labels atomic elements in the sentence into categories such as "PERSON" or 
"LOCATION" . As in the chunking task, each word is assigned a tag prefixed by an indicator 
of the beginning or the inside of an entity. The CoNLL 2003 setufQ is a NER benchmark 
dataset based on Reuters data. The contest provides training, validation and testing sets. 



Florian et al. (2003) presented the best system at the NER CoNLL 2003 challenge, with 



88.76% Fl score. They used a combination of various machine-learning classifiers. Features 
they picked included words, POS tags, CHUNK tags, prefixes and suffixes, a large gazetteer 
(not provided by the challenge) , as well as the output of two other NER classifiers trained 



on richer datasets. Chieu (2003), the second best performer of CoNLL 2003 (88.31% Fl), 



also used an external gazetteer (their performance goes down to 86.84% with no gazetteer) 
and several hand-chosen features. 



Later, Ando and Zhang (2005) reached 89.31% Fl with a semi-supervised approach. 



They trained jointly a linear model on NER with a linear model on two auxiliary 
unsupervised tasks. They also performed Viterbi decoding at test time. The unlabeled 

1. See http: //www, cnts .ua. ac .be/conll20 00/ chunkiiig| 

2. See Tabic [s] for tagging scheme details. 

3. See http:77www.cnts.ua.ac.be/conll2003/iier 
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corpus was 27M words taken from Reuters. Features included words, POS tags, suffixes 
and prefixes or CHUNK tags, but overall were less specialized than CoNLL 2003 challengers. 



2.4 Semantic Role Labeling 

SRL aims at giving a semantic role to a syntactic constituent of a sentence. In the 



PropBank (Palmer et al. , 2005) formalism one assigns roles ARGO-5 to words that are 



arguments of a verb (or more technically, a predicate) in the sentence, e.g. the following 
sentence might be tagged "[Johnj^Rco [atejjj^;/, [the apple] argi "> where "ate" is the 
predicate. The precise arguments depend on a verb's frame and if there are multiple verbs 
in a sentence some words might have multiple tags. In addition to the ARGO-5 tags, 
there there are several modifier tags such as ARGM-LOC (locational) and ARGM-TMP 
(temporal) that operate in a similar way for all verbs. We picked CoNLL 200fl as our SRL 
benchmark. It takes sections 2-21 of WSJ data as training set, and section 24 as validation 
set. A test set composed of section 23 of WSJ concatenated with 3 sections from the Brown 
corpus is also provided by the challenge. 

State-of-the-art SRL systems consist of several stages: producing a parse tree, identifying 
which parse tree nodes represent the arguments of a given verb, and finally classifying these 
nodes to compute the corresponding SRL tags. This entails extracting numerous base 
features from the parse tree and feeding them into statistical models. Feature categories 



commonly used by these system include (Gildea and Jurafsky, 2002; Pradhan et al. , 2004): 



the parts of speech and syntactic labels of words and nodes in the tree; 

the node's position (left or right) in relation to the verb; 

the syntactic path to the verb in the parse tree; 

whether a node in the parse tree is part of a noun or verb phrase; 

the voice of the sentence: active or passive; 

the node's head word; and 

the verb sub-categorization. 



Pradhan et al. (2004) take these base features and define additional features, notably 



the part-of-speech tag of the head word, the predicted named entity class of the argument, 
features providing word sense disambiguation for the verb (they add 25 variants of 12 new 



feature types overall). This system is close to the state-of-the-art in performance. Pradhan 



et al. (2005) obtain 77.30% Fl with a system based on SVM classifiers and simultaneously 



using the two parse trees provided for the SRL task. In the same spirit, Haghighi et al. 



( 2005 ) use log-linear models on each tree node, re-ranked globally with a dynamic algorithm. 



Their system reaches 77.04% using the five top Charniak parse trees. 



Koomen et al. (2005) hold the state-of-the-art with Winnow-like ( Littlestone , 1988) 



classifiers, followed by a decoding stage based on an integer program that enforces specific 
constraints on SRL tags. They reach 77.92% Fl on CoNLL 2005, thanks to the five top 



contest) as well as the Collins ( 1999 ) parse tree. 



parse trees produced by the Charniak ( 2000 ) parser (only the first one was provided by the 



4. See http://www.lsi.upc.edu/~srlconll 
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2.5 Evaluation 



In our experiments, we strictly followed the standard evaluation procedure of each CoNLL 
challenges for NER, CHUNK and SRL. All these three tasks are evaluated by computing the 
Fl scores over chunks produced by our models. The POS task is evaluated by computing 
the per-word accuracy, as it is the case for the standard benchmark we refer to ( [Toutanova 
et al.| |2003[ ). We picked the conlleval scriplj^ for evaluating PO^ NER and CHUNK. 
For SRL, we used the srl-eval.pl script included in the srlconll packag^ 



2.6 Discussion 

When participating in an (open) challenge, it is legitimate to increase generalization by all 
means. It is thus not surprising to see many top CoNLL systems using external labeled data, 



like additional NER classifiers for the NER architecture of Florian et al. (2003) or additional 



parse trees for SRL systems (Koomen et al. , 2005 ). Combining multiple systems or tweaking 



carefully features is also a common approach, like in the chunking top system (Shen and 



Sarkar, 2005) 



However, when comparing systems, we do not learn anything of the quality of each 
system if they were trained with different labeled data. For that reason, we will refer to 
benchmark systems, that is, top existing systems which avoid usage of external data and 



have been well-established in the NLP field: (Toutanova et al. 2003) for POS and (Sha and 



Pereira, 2003) for chunking. For NER we consider (Ando and Zhang, 2005) as they were 



using additional unlabeled data only. We picked (Koomen et al. 2005) for SRL, keeping in 



mind they use 4 additional parse trees not provided by the challenge. These benchmark 
systems will serve as baseline references in our experiments. We marked them in bold 
in Table [21 

We note that for the four tasks we are considering in this work, it can be seen that for the 
more complex tasks (with corresponding lower accuracies), the best systems proposed have 
more engineered features relative to the best systems on the simpler tasks. That is, the POS 
task is one of the simplest of our four tasks, and only has relatively few engineered features, 
whereas SRL is the most complex, and many kinds of features have been designed for it. 
This clearly has implications for as yet unsolved NLP tasks requiring more sophisticated 
semantic understanding than the ones considered here. 



3. The Networks 

All the NLP tasks above can be seen as tasks assigning labels to words. The traditional NLP 
approach is: extract from the sentence a rich set of hand-designed features which are then 
fed to a standard classification algorithm, e.g. a Support Vector Machine (SVM), often with 
a linear kernel. The choice of features is a completely empirical process, mainly based first 
on linguistic intuition, and then trial and error, and the feature selection is task dependent, 
implying additional research for each new NLP task. Complex tasks like SRL then require 
a large number of possibly complex features (e.g., extracted from a parse tree) which can 



5. Available at http://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt 

6. We used the "-r" option of the conlleval script to get the per-word accuracy, for POS only. 

7. Available at ^http: //www. Isi .upc . es/~srlconll/srlconll-l . 1 . tgz 
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Figure 1: Window approach network. 



impact the computational cost which might be important for large-scale applications or 
applications requiring real-time response. 

Instead, we advocate a radically different approach: as input we will try to pre-process 
our features as little as possible and then use a multilayer neural network (NN) architecture, 
trained in an end-to-end fashion. The architecture takes the input sentence and learns 
several layers of feature extraction that process the inputs. The features computed by the 
deep layers of the network are automatically trained by backpropagation to be relevant to 
the task. We describe in this section a general multilayer architecture suitable for all our 
NLP tasks, which is generalizable to other NLP tasks as well. 

Our architecture is summarized in Figure[T]and Figure[2| The first layer extracts features 
for each word. The second layer extracts features from a window of words or from the whole 
sentence, treating it as a sequence with local and global structure (i.e., it is not treated like 
a bag of words). The following layers are standard NN layers. 

Notations We consider a neural network fe{-)., with parameters 6. Any feed- forward 
neural network with L layers, can be seen as a composition of functions /g(-)i corresponding 
to each layer /: 

fe{■) = f|^{ft\■..fl{■) ...))■ 

In the following, we will describe each layer we use in our networks shown in Figure [T] 
and Figure [2j We adopt few notations. Given a matrix A we denote [A\^ j the coefficient 
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Figure 2: Sentence approach network. 

at row i and column j in the matrix. We also denote (A)^™™ the vector obtained by 
concatenating the dwin column vectors around the i*^ column vector of matrix A G M'^i^'^S; 

= ([^]l,i-d^in/2 ••• [^]di,i-d^i„/2 ' [^]l,i+(i,„i„/2 ••• [^]di,i+d^i„/2) • 

As a special case, {A)j represents the i^^ column of matrix A. For a vector v, we denote 
[tijj the scalar at index i in the vector. Finally, a sequence of element {xi, X2, • • . , xt} is 
written The i*^ element of the sequence is [x]^. 
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3.1 Transforming Words into Feature Vectors 

One of the essential key points of our architecture is its ability to perform well with the 
use of (almosl|^ raw words. The ability for our method to learn good word representations 
is thus crucial to our approach. For efficiency, words are fed to our architecture as indices 
taken from a finite dictionary T>. Obviously, a simple index does not carry much useful 
information about the word. However, the first layer of our network maps each of these 
word indices into a feature vector, by a lookup table operation. Given a task of interest, a 
relevant representation of each word is then given by the corresponding lookup table feature 
vector, which is trained by backpropagation. 

More formally, for each word w £ T>, an internal d^^d-dimensional feature vector 
representation is given by the lookup table layer LTw{-): ^ 

LTwiw) = {W)i, A 

where W G ]^^wrd><m ^ matrix of parameters to be learnt, {W)l^ G M'^i^'d is the w^'^ 
column of W and dwrd is the word vector size (a hyper-parameter to be chosen by the user) . 
Given a sentence or any sequence of T words [w]f in V, the lookup table layer applies the 
same operation for each word in the sequence, producing the following output matrix: 

LTwiHj) = [ {W)l^^ {W)l^^ ... Wf^]J. (1) 

This matrix can then be fed to further neural network layers, as we will see below. 

3.1.1 Extending to Any Discrete Features 

One might want to provide features other than words if one suspects that these features are 
helpful for the task of interest. For example, for the NER task, one could provide a feature 
which says if a word is in a gazetteer or not. Another common practice is to introduce some 
basic pre-processing, such as word-stemming or dealing with upper and lower case. In this 
latter option, the word would be then represented by three discrete features: its lower case 
stemmed root, its lower case ending, and a capitalization feature. 

Generally speaking, we can consider a word as represented by K discrete features w G 
T?^ X ■ ■ ■ X T>^ , where T)^ is the dictionary for the k^^ feature. We associate to each feature a 
lookup table LTy^ik{-), with parameters G M°'™rd^l^ I where df^^^ G N is a user-specified 
vector size. Given a word a feature vector of dimension dyjrd = ^wrd then obtained 
by concatenating all lookup table outputs: 



\ LTwk{wk) J \ {W 



K\l 

WK 



8. We did some pre-processing, namely lowercasing and encoding capitalization as another feature. With 
enough (unlabeled) training data, presumably we could learn a model without this processing. Ideally, 
an even more raw input would be to learn from letter sequences rather than words, however we felt that 
this was beyond the scope of this work. 
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The matrix output of the lookup table layer for a sequence of words [w]J is then similar 
to ([T]), but where extra rows have been added for each discrete feature: 



/ {W 



1\1 



1^ {w 



K\l 



(2) 



wk]t ) 



These vector features in the lookup table effectively learn features for words in the dictionary. 
Now, we want to use these trainable features as input to further layers of trainable feature 
extractors, that can represent groups of words and then finally sentences. 

3.2 Extracting Higher Level Features from Word Feature Vectors 

Feature vectors produced by the lookup table layer need to be combined in subsequent layers 
of the neural network to produce a tag decision for each word in the sentence. Producing 
tags for each element in variable length sequences (here, a sentence is a sequence of words) 
is a standard problem in machine-learning. We consider two common approaches which tag 
one word at the time: a window approach, and a (convolutional) sentence approach. 



3.2.1 Window Approach I 

A window approach assumes the tag of a word depends mainly on its neighboring words. 
Given a word to tag, we consider a fixed size ksz (a hyper-parameter) window of words 
around this word. Each word in the window is first passed through the lookup table layer ([T]) 
or ([2]), producing a matrix of word features of fixed size dwrd x ksz- This matrix can be 
viewed as a d^j.^ /cg^-dimensional vector by concatenating each column vector, which can be 
fed to further neural network layers. More formally, the word feature window given by the 
first network layer can be written as: 



r 



fl = {LTw{[w]l)fr- 



(3) 



Linear Layer The fixed size vector fg can be fed to one or several standard neural 
network layers which perform affine transformations over their inputs: 



fe 



(4) 



where G 



and G M^'m are the parameters to be trained. The hyper-parameter 



n 



hu 



is usually called the number of hidden units of the l*^ layer. 



HardTanh Layer Several linear layers are often stacked, interleaved with a non-linearity 
function, to extract highly non-linear features. If no non-linearity is introduced, our network 
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would be a simple linear model. We chose a "hard" version of the hyperbolic tangent as non- 
linearity. It has the advantage of being slightly cheaper to compute (compared to the exact 
hyperbolic tangent), while leaving the generalization performance unchanged ( [Collobert 



2004|). The corresponding layer / applies a HardTanh over its input vector: 



HardTanh( ff' ) , 



where 



-1 ifj;<-l 

HardTanh(x) = x if — 1 <= x <= 1 . (5) 
1 if a; > 1 

Scoring Finally, the output size of the last layer L of our network is equal to the number 
of possible tags for the task of interest. Each output can be then interpreted as a score of 
the corresponding tag (given the input of the network), thanks to a carefully chosen cost 
function that we will describe later in this section. 

Remark 1 (Border Effects) The feature window ^ is not well defined for words near 
the beginning or the end of a sentence. To circumvent this problem, we augment the sentence 
with a special "PADDING" word replicated dmn/'^ times at the beginning and the end. This 
is akin to the use of "start" and "stop" symbols in sequence models. 

3.2.2 Sentence Approach 

We will see in the experimental section that a window approach performs well for most 
natural language processing tasks we are interested in. However this approach fails with 
SRL, where the tag of a word depends on a verb (or, more correctly, predicate) chosen 
beforehand in the sentence. If the verb falls outside the window, one cannot expect this word 
to be tagged correctly. In this particular case, tagging a word requires the consideration of 
the whole sentence. When using neural networks, the natural choice to tackle this problem 



becomes a convolutional approach, first introduced by Waibel et al. ( 1989 ) and also called 
Time Delay Neural Networks (TDNNs) in the literature. 

We describe in detail our convolutional network below. It successively takes the complete 
sentence, passes it through the lookup table layer ([l]), produces local features around each 
word of the sentence thanks to convolutional layers, combines these feature into a global 
feature vector which can then be fed to standard affine layers (Q. In the semantic role 
labeling case, this operation is performed for each word in the sentence, and for each verb 
in the sentence. It is thus necessary to encode in the network architecture which verb we 
are considering in the sentence, and which word we want to tag. For that purpose, each 
word at position i in the sentence is augmented with two features in the way described 



in Section 3.1.1 These features encode the relative distances i — poSy and i — pos^ with 



respect to the chosen verb at position pos^, and the word to tag at position pos^ respectively. 

Convolutional Layer A convolutional layer can be seen as a generalization of a window 
approach: given a sequence represented by columns in a matrix /^"^ (in our lookup table 
matrix ([l])), a matrix- vector operation as in (j4]) is applied to each window of successive 
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Figure 3: Number of features chosen at each word position by the Max layer. We consider 
a sentence approach network (Figure [2]) trained for SRL. The number of "local" features 
output by the convolution layer is 300 per word. By applying a Max over the sentence, 
we obtain 300 features for the whole sentence. It is interesting to see that the network 
catches features mostly around the verb of interest (here "report") and word of interest 
("proposed" (left) or "often" (right)). 



windows in the sequence. Using previous notations, the t*^ output column of the l^^ layer 
can be computed as: 

{fe)\ = W'{fl-^)i--+h' yt, (6) 

where the weight matrix W'' is the same across all windows t in the sequence. Convolutional 
layers extract local features around each window of the given sequence. As for standard 
affine layers Q, convolutional layers are often stacked to extract higher level features. 
In this case, each layer must be followed by a non-linearity ^ or the network would be 
equivalent to one convolutional layer. 

Max Layer The size of the output ^ depends on the number of words in the sentence 
fed to the network. Local feature vectors extracted by the convolutional layers have to be 
combined to obtain a global feature vector, with a fixed size independent of the sentence 
length, in order to apply subsequent standard affine layers. Traditional convolutional 
networks often apply an average (possibly weighted) or a max operation over the "time" t 
of the sequence ([g]). (Here, "time" just means the position in the sentence, this term stems 
from the use of convolutional layers in e.g. speech data where the sequence occurs over 
time.) The average operation does not make much sense in our case, as in general most 
words in the sentence do not have any influence on the semantic role of a given word to tag. 
Instead, we used a max approach, which forces the network to capture the most useful local 
features produced by the convolutional layers (see Figure Isl), for the task at hand. Given a 
matrix /g output by a convolutional layer I — 1, the Max layer I outputs a vector fg. 



max 

t 



1 < i < n 



iA 



l-l 
hu 



(7) 



This fixed sized global feature vector can be then fed to standard affine network layers Q. 
As in the window approach, we then finally produce one score per possible tag for the given 
task. 
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Scheme 


Begin 


Inside 


End 


Single 


Other 


lOB 


B-X 


I-X 


I-X 


B-X 





lOE 


I-X 


I-X 


E-X 


E-X 





lOBES 


B-X 


I-X 


E-X 


S-X 






Table 3: Various tagging schemes. Each word in a segment labeled "X" is tagged with a 
prefixed label, depending of the word position in the segment (begin, inside, end). Single 
word segment labeling is also output. Words not in a labeled segment are labeled "O". 
Variants of the lOB (and lOE) scheme exist, where the prefix B (or E) is replaced by I for 
all segments not contiguous with another segment having the same label "X" . 



Remark 2 T/ie same harder effects arise in the convolution operation ^ as in the window 
approach We again work around this problem by padding the sentences with a special 
word. 



3.2.3 Tagging Schemes 

As explained earlier, the network output layers compute scores for all the possible tags for 
the task of interest. In the window approach, these tags apply to the word located in the 
center of the window. In the (convolutional) sentence approach, these tags apply to the 
word designated by additional markers in the network input. 

The POS task indeed consists of marking the syntactic role of each word. However, the 
remaining three tasks associate labels with segments of a sentence. This is usually achieved 
by using special tagging schemes to identify the segment boundaries, as shown in Table |3| 
Several such schemes have been defined (lOB, lOE, lOBES, . . . ) without clear conclusion 
as to which scheme is better in general. State-of-the-art performance is sometimes obtained 



by combining classifiers trained with different tagging schemes (e.g. Kudo and Matsumoto 



2001j). 

The ground truth for the NER, CHUNK, and SRL tasks is provided using two different 
tagging schemes. In order to eliminate this additional source of variations, we have decided 
to use the most expressive lOBES tagging scheme for all tasks. For instance, in the CHUNK 
task, we describe noun phrases using four different tags. Tag "S-NP" is used to mark a noun 
phrase containing a single word. Otherwise tags "B-NP", "I-NP", and "E-NP" are used 
to mark the first, intermediate and last words of the noun phrase. An additional tag "O" 
marks words that are not members of a chunk. During testing, these tags are then converted 
to the original lOB tagging scheme and fed to the standard performance evaluation scripts 
mentioned in Section Elsl 



3.3 Training 

All our neural networks are trained by maximizing a likelihood over the training data, using 
stochastic gradient ascent. If we denote 6 to be all the trainable parameters of the network, 
which are trained using a training set T we want to maximize the following log-likelihood 
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with respect to 6: 

logp{y\x,0), (8) 

where x corresponds to either a training word window or a sentence and its associated 
features, and y represents the corresponding tag. The probabihty p{-) is computed from the 
outputs of the neural network. We wih see in this section two ways of interpreting neural 
network outputs as probabilities. 



3.3.1 Word-Level Log-Likelihood 

In this approach, each word in a sentence is considered independently. Given an input 
example x, the network with parameters 9 outputs a score [fg{x)]., for the i^^ tag with 
respect to the task of interest. To simplify the notation, we drop x from now, and we write 
instead [/g]^. This score can be interpreted as a conditional tag probability p{i \ x, 6) by 



applying a softmax (Bridle, 1990) operation over all the tags: 

\}el 
J/, 



p{i\x,e)= ^ ,\ . (9) 



Defining the log-add operation as V . 

logaddzi = log(J];e"0, (10) 

i 

% 

we can express the log-likelihood for one training example (x, y) as follows: 

logp(2/ 1 x,0) = [/e]^-logadd [/,].. (11) 



While this training criterion, often referred as cross-entropy is widely used for classification 
problems, it might not be ideal in our case, where there is often a correlation between the 
tag of a word in a sentence and its neighboring tags. We now describe another common 
approach for neural networks which enforces dependencies between the predicted tags in a 
sentence. 



3.3.2 Sentence-Level Log-Likelihood 

In tasks like chunking, NER or SRL we know that there are dependencies between word 
tags in a sentence: not only are tags organized in chunks, but some tags cannot follow 
other tags. Training using a word-level approach discards this kind of labeling information. 
We consider a training scheme which takes into account the sentence structure: given the 
predictions of all tags by our network for all words in a sentence, and given a score for going 
from one tag to another tag, we want to encourage valid paths of tags during training, while 
discouraging all other paths. 

We consider the matrix of scores fg{[x]J) output by the network. As before, we drop the 
input [x]J for notation simplification. The element [fg]^ ^ of the matrix is the score output 
by the network with parameters 9, for the sentence [x]J and for the i^^ tag, at the t^^ word. 
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We introduce a transition score [A]- ■ for jumping from i to j tags in successive words, and 
an initial score [j4] • g for starting from the i*'^ tag. As the transition scores are going to be 
trained (as are all network parameters 6), we define 6 = 9 L) {[A]^ ■ The score of 

a sentence [x]f along a path of tags is then given by the sum of transition scores and 
network scores: 



t=i 



+ [fe 



(12) 



Exactly as for the word- level likelihood (11), where we were normalizing with respect to all 
tags using a softmax (|9|), we normalize this score over all possible tag paths using a 



softmax, and we interpret the resulting ratio as a conditional tag path probability. Taking 
the log, the conditional probability of the true path [yjj is therefore given by: 



log p{[y]i 



T 



NT, 



9) = s{[x]l [y]J, 9) 



logadd s([a;]f , 



(13) 



While the number of terms in the logadd operation ( 11 ) was equal to the number of tags, it 



grows exponentially with the length of the sentence in (13). Fortunately, one can compute 



it in linear time with the following standard recursion over t, taking advantage of the 
associativity and distributivity on the semi-rin^ (MU {— oo}, logadd, +): 



6t{k) ^ logadd s{[x]\, [i]*, ~9) 

= logadd logadd bff \ 

^ {[i]*in[j]t_i=in[j],=fc} 



= logadd 5t_i(i) + [A]i,fe + [/e]fc,t 

i 

= [/e]/c,* + logadd (<5t_i(i) + [A].,) Vfc, 

followed by the termination 

logadd s([a;]f, , 9) = logadd Srii) ■ 



(14) 



(15) 



We can now maximize in ([s]) the log-likelihood (13) over all the training pairs ([x]^, [y]f). 
At inference time, given a sentence [x\i to tag, we have to find the best tag path which 



minimizes the sentence score (12). In other words, we must find 



argmaxs([a;]i , [j]^ , 



(16) 



The Viterbi algorithm is the natural choice for this inference. It corresponds to performing 



the recursion (14) and (15), but where the logadd is replaced by a max, and then tracking 



back the optimal path through each max. 
9. In other words, read logadd as © and + as (g). 
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Remark 3 (Graph Transformer Networks) Our approach is a particular case of the 
discriminative forward training for graph transformer networks (GTNs) ^Bottou et al. , 1991; 
Le Cun et al., 1998). The log-likelihood (13) can he viewed as the difference between the 



forward score constrained over the valid paths (in our case there is only the labeled path) 
and the unconstrained forward score 



15) 



Remark 4 (Conditional Random Fields) An important feature of equation (12) is the 

absence of normalization. Summing the exponentials over all possible tags does 

not necessarily yield the unity. If this was the case, the scores could be viewed as the 
logarithms of conditional transition probabilities, and our model would be subject to the 
label-bias problem that motivates Conditional Random Fields (CRFs) \Lafferty et al. , 2001). 
The denormalized scores should instead be likened to the potential functions of a CRF. 
In fact, a CRF maximizes the same likelihood (13) using a linear model instead of a 
nonlinear neural network. CRFs have been widely used in the NLP world, such as for PCS 



tagging (Lafferty et al., 2001), chunking [Sha and Pereira, 200^), NER (McCallum and L 



2003) or SRL (Cohn and Blunsom, 2005). Compared to such CRFs, we take advantage of 



the nonlinear network to learn appropriate features for each task of interest. 



3.3.3 Stochastic Gradient 



Maximizing ([s]) witli stociiastic gradient (Bottou, 1991) is achieved by iteratively selecting 
a random example {x, y) and making a gradient step: 



+ A 



dlogp{y I X, 6) 
do 



(17) 



where A is a chosen learning rate. Our neural networks described in Figure [T] and Figure [2] 
are a succession of layers that correspond to successive composition of functions. The neural 



network is finally composed with the word- level log- likelihood ( 11 ), or successively composed 
in the recursion (14) if using the sentence-level log-likelihood (13). Thus, an analytical 



formulation of the derivative (17) can be computed, by applying the differentiation chain 
rule through the network, and through the word- level log- likelihood (11) or through the 



recurrence (14). 



Remark 5 (Differentiability) Our cost functions are differentiable almost everywhere. 
N on- differ entiahle points arise because we use a "hard" transfer function ^ and because 
we use a "max" layer ^ in the sentence approach network. Fortunately, stochastic 
gradient still converges to a meaningful local minimum despite such minor differentiability 
problems (Bottou, 199 1\ [Tgg(§| j. Stochastic gradient iterations that hit a non- differentiability 
are simply skipped. 



Remark 6 (Modular Approach) The well known "back-propagation" algorithm (LeCun\ 



1985; Rumelhart et al. 



1986) computes gradients using the chain rule. The chain rule can 
also be used in a modular implementation ^ Our modules correspond to the boxes in Figure\^ 
and Figure^ Civen derivatives with respect to its outputs, each module can independently 



10. See http: //torchS . sf .net 
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Approach 


POS 


Chunking 


NER 


SRL 




(PWA) 


(Fl) 


(Fl) 


(Fl) 


Benchmark Systems 


97.24 


94.29 


89.31 


77.92 


NN+WLL 


96.31 


89.13 


79.53 


55.40 


NN+SLL 


96.37 


90.33 


81.47 


70.99 



Table 4: Comparison in generalization performance of benchmark NLP systems with a 
vanilla neural network (NN) approach, on POS, chunking, NER and SRL tasks. We report 
results with both the word-level log-likelihood (WLL) and the sentence-level log-likelihood 
(SLL). Generalization performance is reported in per- word accuracy rate (PWA) for POS 
and Fl score for other tasks. The NN results are behind the benchmark results, in Section|4] 
we show how to improve these models using unlabeled data. 



Task 


Window/ Conv. size 


Word dim. 


Caps dim. 


Hidden units 


Learning rate 


POS 


dwin — 5 


(f = 50 


=5 


<u = 300 


A = 0.01 


CHUNK 




r> 


•}•) 


11 


11 


NER 




Vi 




11 


11 


SRL 






11 


<u = 300 
<u = 500 


)? 



Table 5: Hyper-parameters of our networks. We report for each task the window size 
(or convolution size), word feature dimension, capital feature dimension, number of hidden 
units and learning rate. 




compute derivatives with respect to its inputs and with respect to its trainable parameters, 



as proposed by Bottou and Gallinari (1991). This allows us to easily build variants of our 



networks. For details about gradient computations, see Appendix P] 



et al. 



Remark 7 (Tricks) Many tricks have been reported for training neural networks (LeCun 
J99§. Which ones to choose is often confusing. We employed only two of them: the 



initialization and update of the parameters of each network layer were done according to 
the "f an-in" of the layer , that is the number of inputs used to compute each output of this 
layer (Plaut and Hinton, 1981). The fan-in for the lookup table (M), the l^^ linear layer ^ 
and the convolution layer are respectively 1, n^j^ and dwin ^''^tu ■ initial parameters 
of the network were drawn from a centered uniform distribution, with a variance equal to 
the inverse of the square-root of the fan-in. The learning rate in was divided by the 
fan-in, but stays fixed during the training. 



3.4 Supervised Benchmark Results 

For POS, chunking and NER tasks, we report results with the window architecture described 



in Section 3.2.1 The SRL task was trained using the sentence approach (Section 3.2.2). 



Results are reported in Table |4| in per- word accuracy (PWA) for POS, and Fl score for all 
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FRANCE 


JESUS 


XBOX 


REDDISH 


SCRATCHED 


MEGABITS 


454 


1973 


6909 


11724 


^ r> 

29869 


87025 


PERSUADE 


THICKETS 


DECADENT 


WIDESCREEN 


ODD 


PPA 


FAW 


SAVARY 


DIVO 


ANTICA 


ANCHIETA 


UDDIN 


BLACKSTOCK 


SYMPATHETIC 


VERUS 


SHABBY 


EMIGRATION 


BIOLOGICALLY 


GIORGI 


JFK 


OXIDE 


AWE 


MARKING 


KAYAK 


SHAHEED 


KHWARAZM 


URBINA 


THUD 


HEUER 


MCLARENS 


RUMELIA 


STATIONERY 


EPOS 


OCCUPANT 


SAMBHAJI 


GLADWIN 


PLANUM 


ILIAS 


EGLINTON 


REVISED 


WORSHIPPERS 


CENTRALLY 


goa'uld 


GSNUMBER 


EDGING 


LEAVENED 


RITSUKO 


INDONESIA 


COLLATION 


OPERATOR 


FRG 


PANDIONIDAE 


LIFELESS 


MONEO 


BACHA 


W.J. 


NAMSOS 


SHIRT 


MAHAN 


NILGIRIS 



Table 6: Word embeddings in the word lookup table of a SRL neural network trained from 
scratch, with a dictionary of size 100, 000. For each column the queried word is followed by 
its index in the dictionary (higher means more rare) and its 10 nearest neighbors (arbitrary 
using the Euclidean metric). 



the other tasks. We performed experiments both with the word-level log-likelihood (WLL) 
and with the sentence-level log-likelihood (SLL). The hyper-parameters of our networks are 
reported in Table[5} All our networks were fed with two raw text features: lower case words, 
and a capital letter feature. We chose to consider lower case words to limit the number 
of words in the dictionary. However, to keep some upper case information lost by this 
transformation, we added a "caps" feature which tells if each word was in low caps, was all 
caps, had first letter capital, or had one capital. Additionally, all occurrences of sequences 
of numbers within a word are replaced with the string "NUMBER" , so for example both the 
words "PSl" and "PS2" would map to the single word "psNUMBER" . We used a dictionary 
containing the 100,000 most common words in WSJ (case insensitive). Words outside this 
dictionary were replaced by a single special "RARE" word. 

Results show that neural networks "out-of-the-box" are behind baseline benchmark 
systems. Looking at all submitted systems reported on each CoNLL challenge website 
showed us our networks performance are nevertheless in the performance ballpark of existing 
approaches. The training criterion which takes into account the sentence structure (SLL) 
seems to boost the performance for the Chunking, NER and SRL tasks, with little advantage 
for POS. This result is in line with existing NLP studies comparing sentence- level and word- 
level likelihoods (Liang et al. , 2008). The capacity of our network architectures lies mainly 
in the word lookup table, which contains 50 x 100, 000 parameters to train. In the WSJ data, 
15% of the most common words appear about 90% of the time. Many words appear only 
a few times. It is thus very difficult to train properly their corresponding 50 dimensional 
feature vectors in the lookup table. Ideally, we would like semantically similar words to be 
close in the embedding space represented by the word lookup table: by continuity of the 
neural network function, tags produced on semantically similar sentences would be similar. 
We show in Table [6] that it is not the case: neighboring words in the embedding space do 
not seem to be semantically related. 
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(a) POS (b) CHUNK (c) NER (d) SRL 



Figure 4: Fl score on the validation set (y-axis) versus number of hidden units (x-axis) 
for different tasks trained with the sentence- level likelihood (SLL), as in Table m For SRL, 
we vary in this graph only the number of hidden units in the second layer. The scale is 
adapted for each task. We show the standard deviation (obtained over 5 runs with different 
random initialization), for the architecture we picked (300 hidden units for POS, CHUNK 
and NER, 500 for SRL). 



We will focus in the next section on improving these word embeddings by leveraging 
unlabeled data. We will see our approach results in a performance boost for all tasks. 

Remark 8 (Architectures) In all our experiments in this paper, we tuned the hyper- 
parameters by trying only a few different architectures by validation. In practice, the choice 
of hyperparameters such as the number of hidden units, provided they are large enough, has 
a limited impact on the generalization performance. In Figure we report the Fl score 
for each task on the validation set, with respect to the number of hidden units. Considering 
the variance related to the network initialization, we chose the smallest network achieving 
"reasonable" performance, rather than picking the network achieving the top performance 
obtained on a single run. 



Remark 9 (Training Time) Training our network is quite computationally expensive. 
Chunking and NER take about one hour to train, POS takes few hours, and SRL takes 
about three days. Training could be faster with a larger learning rate, but we prefered to 
stick to a small one which works, rather than finding the optimal one for speed. Second 



order methods (LeCun et al., 1998) could be another speedup technique. 



4. Lots of Unlabeled Data 

We would like to obtain word embeddings carrying more syntactic and semantic information 
than shown in Table [6j Since most of the trainable parameters of our system are associated 
with the word embeddings, these poor results suggest that we should use considerably 
more training data. Following our NLP from scratch philosophy, we now describe how 
to dramatically improve these embeddings using large unlabeled datasets. We then use 
these improved embeddings to initialize the word lookup tables of the networks described 
in Section 
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4.1 Datasets 

Our first English corpus is the entire Enghsh Wikipediaj^ We have removed all paragraphs 
containing non-roman characters and all MediaWiki markups. The resulting text was 
tokenized using the Penn Treebank tokenizer script]^ The resulting dataset contains about 
631 million words. As in our previous experiments, we use a dictionary containing the 
100,000 most common words in WSJ, with the same processing of capitals and numbers. 
Again, words outside the dictionary were replaced by the special "RARE" word. 

Our second English corpus is composed by adding an extra 221 million words extracted 
from the Reuters RCVl (Lewis et al. 2004) datasetp^ We also extended the dictionary to 



130, 000 words by adding the 30, 000 most common words in Reuters. This is useful in order 
to determine whether improvements can be achieved by further increasing the unlabeled 
dataset size. 



4.2 Ranking Criterion versus Entropy Criterion 



We used these unlabeled datasets to train language models that compute scores describing 
the acceptability of a piece of text. These language models are again large neural networks 
using the window approach described in Section 3.2.1 and in Figure [TJ As in the previous 
section, most of the trainable parameters are located in the lookup tables. 



Similar language models were already proposed by Bengio and Ducharme (2001) and 



Schwenk and Gauvain ( 2002 ) . Their goal was to estimate the probability of a word given 



the previous words in a sentence. Estimating conditional probabilities suggests a cross- 
entropy criterion similar to those described in Section [3. 3. 1[ Because the dictionary size is 
large, computing the normalization term can be extremely demanding, and sophisticated 
approximations are required. More importantly for us, neither work leads to significant 
word embeddings being reported. 



Shannon (1951) has estimated the entropy of the English language between 0.6 and 1.3 



bits per character by asking human subjects to guess upcoming characters. Cover and King 



(1978) give a lower bound of 1.25 bits per character using a subtle gambling approach. 



Meanwhile, using a simple word trigram model, Brown et al. (1992b) reach 1.75 bits per 



character. Teahan and Cleary (1996) obtain entropies as low as 1.46 bits per character 



using variable length character ?i- grams. The human subjects rely of course on all their 
knowledge of the language and of the world. Can we learn the grammatical structure of the 
English language and the nature of the world by leveraging the 0.2 bits per character that 
separate human subjects from simple n-gram models? Since such tasks certainly require 
high capacity models, obtaining sufficiently small confidence intervals on the test set entropy 
may require prohibitively large training sets The entropy criterion lacks dynamical range 



because its numerical value is largely determined by the most frequent phrases. In order to 
learn syntax, rare but legal phrases are no less significant than common phrases. 



11. Available at http://download.wikimedia.org We took the November 2007 version. 

12. Available at http://www.cis.upenn.edu/~treebank/tokenization.html 

13. Now available at htt p: //tree .nist . gov/data/reuters/r enters .html 

14. However, [Klein and Manning ( 2002[ ) describe a rare example ol realistic unsupervised grammar induction 
using a cross-entropy approach on binary-branching parsing trees, that is, by forcing the system to 
generate a hierarchical representation. 
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It is therefore desirable to define alternative training criteria. We propose here to use a 



pairwise ranking approach (Cohen et al. , 1998). We seek a network that computes a higher 



score when given a legal phrase than when given an incorrect phrase. Because the ranking 
literature often deals with information retrieval applications, many authors define complex 
ranking criteria that give more weight to the ordering of the best ranking instances (see 



Burges et al. , 2007 Clemengon and Vayatis, 2007). However, in our case, we do not want 



to emphasize the most common phrase over the rare but legal phrases. Therefore we use a 
simple pairwise criterion. 

We consider a window approach network, as described in Section 3.2.1 and Figure [T| 



with parameters 9 which outputs a score /g(x) given a window of text x 
minimize the ranking criterion with respect to 6: 

^ 5] ^ max { , 1 - fo{x) + /.(x^^^) } 
xex wev 



We 



(18) 



where X is the set of all possible text windows with words coming from our training 
corpus, T> is the dictionary of words, and x^'^^ denotes the text window obtained by replacing 
the central word of text window [w]^™'" by the word w. 



Okanohara and Tsujii (2007) use a related approach to avoiding the entropy criteria 
using a binary classification approach (correct /incorrect phrase). Their work focuses on 



using a kernel classifier, and not on learning word embeddings as we do here. Smith and 



Eisner ( 2005 ) also propose a contrastive criterion which estimates the likelihood of the data 



conditioned to a "negative" neighborhood. They consider various data neighborhoods, 
including sentences of length dwin drawn from 'D'^^'". Their goal was however to perform 
well on some tagging task on fully unsupervised data, rather than obtaining generic word 
embeddings useful for other tasks. 



4.3 Training Language Models 

The language model network was trained by stochastic gradient minimization of the ranking 



criterion (18), sampling a sentence-word pair (s, w) at each iteration. 



Since training times for such large scale systems are counted in weeks, it is not feasible 
to try many combinations of hyperparameters. It also makes sense to speed up the training 
time by initializing new networks with the embeddings computed by earlier networks. In 
particular, we found it expedient to train a succession of networks using increasingly large 
dictionaries, each network being initialized with the embeddings of the previous network. 



Successive dictionary sizes and switching times are chosen arbitrarily. (Bengio et al. 2009) 



provides a more detailed discussion of this, the (as yet, poorly understood) "curriculum" 
process. 

For the purposes of model selection we use the process of "breeding". The idea of 
breeding is instead of trying a full grid search of possible values (which we did not have 
enough computing power for) to search for the parameters in anology to breeding biological 
cell lines. Within each line, child networks are initialized with the embeddings of their 
parents and trained on increasingly rich datasets with sometimes different parameters. That 
is, suppose we have k processors, which is much less than the possible set of parameters 
one would like to try. One chooses k initial parameter choices from the large set, and trains 



21 



COLLOBERT, WeSTON, BOTTOU, KARLEN, KAVUKCUOGLU AND KUKSA 



these on the k processors. In our case, possible parameters to adjust are: the learning rate 
A, the word embedding dimensions d, number of hidden units nj^^ and input window size 
dwin- One then trains each of these models in an online fashion for a certain amount of 
time (i.e. a few days), and then selects the best ones using the validation set error rate. 



That is, breeding decisions were made on the basis of the value of the ranking criterion ( 18 ) 
estimated on a validation set composed of one million words held out from the Wikipedia 
corpus. In the next breeding iteration, one then chooses another set of k parameters from 
the possible grid of values that permute slightly the most successful candidates from the 
previous round. As many of these parameter choices can share weights, we can effectively 
continue online training retaining some of the learning from the previous iterations. 

Very long training times make such strategies necessary for the foreseeable future: if we 
had been given computers ten times faster, we probably would have found uses for datasets 
ten times bigger. However, we should say we believe that although we ended up with a 
particular choice of parameters, many other choices are almost equally as good, although 
perhaps there are others that are better as we could not do a full grid search. 

In the following subsections, we report results obtained with two trained language 
models. The results achieved by these two models are representative of those achieved 
by networks trained on the full corpuses. 

• Language model LMl has a window size dwin = H and a hidden layer with nj^^ = 100 
units. The embedding layers were dimensioned like those of the supervised networks 
(Table [5]). Model LMl was trained on our first English corpus (Wikipedia) using 
successive dictionaries composed of the 5000, 10, 000, 30, 000, 50, 000 and finally 
100, 000 most common WSJ words. The total training time was about four weeks. 

• Language model LM2 has the same dimensions. It was initialized with the embeddings 
of LMl, and trained for an additional three weeks on our second English corpus 
(Wikipedia+Reuters) using a dictionary size of 130,000 words. 

4.4 Embeddings 



Both networks produce much more appealing word embeddings than in Section 3.4 Table[7] 



shows the ten nearest neighbors of a few randomly chosen query words for the LMl model. 
The syntactic and semantic properties of the neighbors are clearly related to those of the 
query word. These results are far more satisfactory than those reported in Table [7] for 
embeddings obtained using purely supervised training of the benchmark NLP tasks. 

4.5 Semi-supervised Benchmark Results 

Semi-supervised learning has been the object of much attention during the last few years (see 



Chapelle et al. , 2006). Previous semi-supervised approaches for NLP can be roughly 



categorized as follows: 



Ad-hoc approaches such as (Rosenfeld and Feldman, 2007) for relation extraction. 



Self-training approaches, such as (Ueffing et al. , 2007) for machine translation. 



and (McClosky et al. 2006) for parsing. These methods augment the labeled training 
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FRANCE 


JESUS 


XBOX 


REDDISH 


SCRATCHED 


MEGABITS 


454 


1973 


6909 


11724 


29869 


87025 


AUSTRIA 


GOD 


AMIGA 


GREENISH 


NAILED 


OCTETS 


BELGIUM 


SATI 


PLAYSTATION 


BLUISH 


SMASHED 


mb/s 


GERMANY 


CHRIST 


MSX 


PINKISH 


PUNCHED 


bit/s 


ITALY 


SATAN 


IPOD 


PURPLISH 


POPPED 


BAUD 


GREECE 


KALI 


SEGA 


BROWNISH 


CRIMPED 


CARATS 


SWEDEN 


INDRA 


PSNUMBER 


GREYISH 


SCRAPED 


kbit/s 


NORWAY 


VISHNU 


HD 


GRAYISH 


SCREWED 


MEGAHERTZ 


EUROPE 


ANANDA 


DREAMCAST 


WHITISH 


SECTIONED 


MEGAPIXELS 


HUNGARY 


PARVATI 


GEFORCE 


SILVERY 


SLASHED 


gbit/s 


SWITZERLAND 


GRACE 


CAPCOM 


YELLOWISH 


RIPPED 


AMPERES 



Table 7: Word embeddings in the word lookup table of the language model neural network 
LMl trained with a dictionary of size 100, 000. For each column the queried word is followed 
by its index in the dictionary (higher means more rare) and its 10 nearest neighbors (using 
the Euclidean metric, which was chosen arbitrarily). 



set with examples from the unlabeled dataset using the labels predicted by the model 
itself. Transductive approaches, such as (Joachims, 1999) for text classification can 
be viewed refined form of self-training. 



Parameter sharing approaches such as (Ando and Zhang, 2005; Suzuki and Isozaki 



2008). Ando and Zhang propose a multi-task approach where they jointly train 



models sharing certain parameters. They train POS and NER models together with a 
language model (trained on 15 million words) consisting of predicting words given the 



surrounding tokens. [Suzuki and Isozaki embed a generative model (Hidden Markov 
Model) inside a CRF for POS, Chunking and NER. The generative model is trained 
on one billion words. These approaches should be seen as a linear counterpart of our 
work. Using multilayer models vastly expands the parameter sharing opportunities 
(see Section [5]). 

Our approach simply consists of initializing the word lookup tables of the supervised 
networks with the embeddings computed by the language models. Supervised training is 
then performed as in Section |3.4[ In particular the supervised training stage is free to 
modify the lookup tables. This sequential approach is computationally convenient because 
it separates the lengthy training of the language models from the relatively fast training of 
the supervised networks. Once the language models are trained, we can perform multiple 
experiments on the supervised networks in a relatively short time. Note that our procedure 



is clearly linked to the (semi-supervised) deep learning procedures of (Hinton et al. 
Bengio et am2007t [Weston et al.[|2008[ ). 



2006 



Table [8] clearly shows that this simple initialization significantly boosts the generalization 
performance of the supervised networks for each task. It is worth mentioning the larger 
language model led to even better performance. This suggests that we could still take 
advantage of even bigger unlabeled datasets. 
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97.24 


94.29 


89.31 


77.92 


NN+WLL 


96.31 


89.13 


79.53 


55.40 


NN+SLL 


96.37 


90.33 


81.47 


70.99 


NN+WLL+LMl 


97.05 


91.91 


85.68 


58.18 


NN+SLL+LMl 


97.10 


93.65 


87.58 


73.84 


NN+WLL+LM2 


97.14 


92.04 


86.96 


58.34 


NN+SLL+LM2 


97.20 


93.63 


88.67 


74.15 



Table 8: Comparison in generalization performance of benchmark NLP systems with our 
(NN) approach on POS, chunking, NER and SRL tasks. We report results with both the 
word-level log- likelihood (WLL) and the sentence- level log-likelihood (SLL). We report with 
(LMn) performance of the networks trained from the language model embeddings (Table[7]). 
Generalization performance is reported in per-word accuracy (PWA) for POS and Fl score 
for other tasks. 



4.6 Ranking and Language y 

There is a large agreement in the NLP community that syntax is a necessary prerequisite for 



semantic role labeling (Gildea and Palmer 2002). This is why state-of-the-art semantic role 



labeling systems thoroughly exploit multiple parse trees. The parsers themselves (Charniak 



2000: Collins 1999) contain considerable prior information about syntax (one can think of 



this as a kind of informed pre-processing) . 

Our system does not use such parse trees because we attempt to learn this information 
from the unlabeled data set. It is therefore legitimate to question whether our ranking 



criterion ( 18 ) has the conceptual capability to capture such a rich hierarchical information. 



At first glance, the ranking task appears unrelated to the induction of probabilistic 
grammars that underly standard parsing algorithms. The lack of hierarchical representation 



1956). 



seems a fatal flaw ( Chomsky 
However 

structure: operator grammars (iHarris 



ranking is closely related to an alternative description of the language 

Instead of directly studying the structure 



1968). 



of a sentence, Harris defines an algebraic structure on the space of all sentences. Starting 
from a couple of elementary sentence forms, sentences are described by the successive 
application of sentence transformation operators. The sentence structure is revealed as 
a side effect of the successive transformations. Sentence transformations can also have a 
semantic interpretation. 



In the spirit of structural linguistics, Harris describes procedures to discover sentence 
transformation operators by leveraging the statistical regularities of the language. Such 
procedures are obviously useful for machine learning approaches. In particular, he proposes 
a test to decide whether two sentences forms are semantically related by a transformation 



operator. He first defines a ranking criterion (Harris, 1968 section 4.1): 



"Starting for convenience with very short sentence forms, say ABC, we 
choose a particular word choice for all the classes, say BgCq, except one, in 
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this case A; for every pair of members Ai, Aj of that word class we ask how 
the sentence formed with one of the members, i.e. AiBqCq compares as to 
acceptabihty with the sentence formed with the other member, i.e. AjBgCq." 

These gradings are then used to compare sentence forms: 

"It now turns out that, given the graded n-tuples of words for a particular 
sentence form, we can find other sentences forms of the same word classes in 
which the same n-tuples of words produce the same grading of sentences." 

This is an indication that these two sentence forms exploit common words with the same 
syntactic function and possibly the same meaning. This observation forms the empirical 
basis for the construction of operator grammars that describe real- world natural languages 
such as English. ' 



Therefore there are solid reasons to believe that the ranking criterion ( 18 ) has the 



conceptual potential to capture strong syntactic and semantic information. On the other 
hand, the structure of our language models is probably too restrictive for such goals, and 
our current approach only exploits the word embeddings discovered during training. 



5. Multi-Task Learning 



\ 



It is generally accepted that features trained for one task can be useful for related tasks. This 
idea was already exploited in the previous section when certain language model features, 
namely the word embeddings, were used to initialize the supervised networks. 

Multi-task learning (MTL) leverages this idea in a more systematic way. Models for 
all tasks of interests are jointly trained with an additional linkage between their trainable 
parameters in the hope of improving the generalization error. This linkage can take the form 
of a regularization term in the joint cost function that biases the models towards common 
representations. A much simpler approach consists in having the models share certain 
parameters defined a priori. Multi-task learning has a long history in machine learning and 



neural networks. Caruana (1997) gives a good overview of these past efforts. 



5.1 Joint Decoding versus Joint Training 

Multitask approaches do not necessarily involve joint training. For instance, modern speech 
recognition systems use Bayes rule to combine the outputs of an acoustic model trained on 



speech data and a language model trained on phonetic or textual corpora (Jelinek, 1976). 



This joint decoding approach has been successfully applied to structurally more complex 
NLP tasks. Sutton and McCallum (2005b) obtains improved results by combining the 



predictions of independently trained CRF models using a joint decoding process at test 
time that requires more sophisticated probabilistic inference techniques. On the other 
hand, Sutton and McCallum (2005a) obtain results somewhat below the state-of-the-art 



using joint decoding for SRL and syntactic parsing. Musillo and Merlo (2006) also describe 
a negative result at the same joint task. 

Joint decoding invariably works by considering additional probabilistic dependency 
paths between the models. Therefore it defines an implicit supermodel that describes 
all the tasks in the same probabilistic framework. Separately training a submodel only 
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97.24 


94.29 


89.31 








Window Approach 




NN+SLL+LM2 


97.20 


93.63 


88.67 




NN+SLL+LM2+MTL 


97.22 


94.10 


88.62 








Sentence Approach 




NN+SLL+LM2 


97.12 


93.37 


88.78 


74.15 


NN+SLL+LM2+MTL 


97.22 


93.75 


88.27 


74.29 



Table 9: Effect of multi-tasking on our neural architectures. We trained POS, CHUNK 
NER in a MTL way, both for the window and sentence network approaches. SRL was only 
included in the sentence approach joint training. As a baseline, we show previous results 
of our window approach system, as well as additional results for our sentence approach 
system, when trained separately on each task. Benchmark system performance is also given 
for comparison. 



makes sense when the training data blocks these additional dependency paths (in the sense 
of d-separation, Pearl, 1988). This implies that, without joint training, the additional 



dependency paths cannot directly involve unobserved variables. Therefore, the natural idea 
of discovering common internal representations across tasks requires joint training. 

Joint training is relatively straightforward when the training sets for the individual 
tasks contain the same patterns with different labels. It is then sufficient to train a model 
that computes multiple outputs for each pattern (Suddarth and Holden, 1991). Using 



this scheme, Sutton et al. (2007) demonstrates improvements on POS tagging and noun- 
phrase chunking using jointly trained CRFs. However the joint labeling requirement is a 
limitation because such data is not often available. Miller et al. (2000) achieves performance 
improvements by jointly training NER, parsing, and relation extraction in a statistical 
parsing model. The joint labeling requirement problem was weakened using a predictor to 
fill in the missing annotations. 

Ando and Zhang (2005) propose a setup that works around the joint labeling 

■wj ^{x) + vj Q"^{x) where 



requirements. They define linear models of the form fi{x) 
fi is the classifier for the i-th. task with parameters Wi and Vi. Notations ^{x) and ^{x) 
represent engineered features for the pattern x. Matrix maps the ^(x) features into a low 
dimensional subspace common across all tasks. Each task is trained using its own examples 
without a joint labeling requirement. The learning procedure alternates the optimization 
of Wi and Vi for each task, and the optimization of to minimize the average loss for all 
examples in all tasks. The authors also consider auxiliary unsupervised tasks for predicting 
substructures. They report excellent results on several tasks, including POS and NER. 
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Figure 5: Example of multitasking with NN. Task 1 and Task 2 are two tasks trained with 
the window approach architecture presented in Figure [T] Lookup tables as well as the first 
hidden layer are shared. The last layer is task specific. The principle is the same with more 
than two tasks. 



5.2 Multi-Task Benchmark Results 

Table |9] reports results obtained by jointly trained models for the POS, CHUNK, NER and 



SRL tasks using the same setup as Section 4.5, We trained jointly POS, CHUNK and NER 



using the window approach network. As we mentioned earlier, SRL can be trained only 
with the sentence approach network, due to long-range dependencies related to the verb 
predicate. We thus also trained all four tasks using the sentence approach network. In 
both cases, all models share the lookup table parameters ([2]). The parameters of the first 
linear layers Q were shared in the window approach case (see Figure [5]), and the first the 
convolution layer parameters ^ were shared in the sentence approach networks. 

For the window approach, best results were obtained by enlarging the first hidden layer 
size to njj^ = 500 (chosen by validation) in order to account for its shared responsibilities. 
We used the same architecture than SRL for the sentence approach network. The word 
embedding dimension was kept constant = 50 in order to reuse the language models 
of Section 14.51 

Training was achieved by minimizing the loss averaged across all tasks. This is easily 
achieved with stochastic gradient by alternatively picking examples for each task and 



applying (17) to all the parameters of the corresponding model, including the shared 
parameters. Note that this gives each task equal weight. Since each task uses the training 
sets described in Table [l| it is worth noticing that examples can come from quite different 
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88.67 




NN+SLL+LM2+CHUNK 








74.72 



Table 10: Comparison in generalization performance of benchmark NLP systems with 
our neural networks (NNs) using increasing task-specific engineering. We report resuhs 
obtained with a network trained without the extra task-specific features (Section [s]) and 
with the extra task-specific features described in Section [6j The POS network was trained 
with two character word suffixes; the NER network was trained using the small CoNLL 
2003 gazetteer; the CHUNK and NER networks were trained with additional POS features; 
and finally, the SRL network was trained with additional CHUNK features. 



datasets. The generalization performance for each task was measured using the traditional 
testing data specified in Table [T] Fortunately, none of the training and test sets overlap 
across tasks. 

While we find worth mentioning that MTL can produce a single unified architecture that 
performs well for all these tasks, no (or only marginal) improvements were obtained with 
this approach compared to training separate architectures per task (which still use semi- 
supervised learning, which is somehow the most important MTL task). The next section 
shows we can leverage known correlations between tasks in more direct manner. 

6. The Temptation 



Results so far have been obtained by staying (almost ) true to our from scratch philosophy. 
We have so far avoided specializing our architecture for any task, disregarding a lot of useful 
a priori NLP knowledge. We have shown that, thanks to large unlabeled datasets, our 
generic neural networks can still achieve close to state-of-the-art performance by discovering 
useful features. This section explores what happens when we increase the level of task- 
specific engineering in our systems by incorporating some common techniques from the 
NLP literature. We often obtain further improvements. These figures are useful to quantify 
how far we went by leveraging large datasets instead of relying on a priori knowledge. 

6.1 SufRx Features 

Word suffixes in many western languages are strong predictors of the syntactic function 



of the word and therefore can benefit the POS system. For instance, Ratnaparkhi (1996) 



15. We did some basic preprocessing of the raw input words as described in Section |3.4| hence the "almost" 
in the title of this article. A completely from scratch approach would presumably not know anything 
about words at all and would work from letters only (or, taken to a further extreme, from speech or 
optical character recognition, as humans do). 



28 



Natural Language Processing (almost) from Scratch 



uses inputs representing word suffixes and prefixes up to four characters. We acliieve this 



in the POS task by adding discrete word features (Section 3.1.1) representing the last two 
characters of every word. The size of the suffix dictionary was 455. This led to a small 
improvement of the POS performance (Table [To| row NN+SLL+LM2+SufBx2). We also tried 
suffixes obtained with the Porter ( 1980[ ) stemmer and obtained the same performance as 
when using two character suffixes. 



6.2 Gazetteers 

State-of-the-art NER systems often use a large dictionary containing well known named 
entities (e.g. Florian et al. , 2003). We restricted ourselves to the gazetteer provided 
by the CoNLL challenge, containing 8, 000 locations, person names, organizations, and 
miscellaneous entities. We trained a NER network with 4 additional word features indicating 
(feature "on" or "off") whether the word is found in the gazetteer under one of these four 
categories. The gazetteer includes not only words, but also chunks of words. If a sentence 
chunk is found in the gazetteer, then all words in the chunk have their corresponding 
gazetteer feature turned to "on". The resulting system displays a clear performance 



improvement (Table 10 row NN+SLL+LM2+Gazetteer), slightly outperforming the baseline. 
A plausible explanation of this large boost over the network using only the language model 
is that gazeetters include word chunks, while we use only the word representation of our 
language model. For example, "united" and "bicycle" seen separately are likely to be non- 
entities, while "united bicycle" might be an entity, but catching it would require higher 
level representations of our language model. 

\ ^ 

6.3 Cascading " 

When one considers related tasks, it is reasonable to assume that tags obtained for one task 
can be useful for taking decisions in the other tasks. Conventional NLP systems often use 



features obtained from the output of other preexisting NLP systems. For instance, Shen 



and Sarkar (2005) describe a chunking system that uses POS tags as input; Florian et al. 
(2003) describes a NER system whose inputs include POS and CHUNK tags, as well as 
the output of two other NER classifiers. State-of-the-art SRL systems exploit parse trees 



(Gildea and Palmer 2002 


Punyakanok et al. 


using POS tags ( 


Charniak 


2000 


Collins 


1999 



2005), related to CHUNK tags, and built 



Table reports results obtained for the CHUNK and NER tasks by adding discrete 



word features (Section 3.1.1) representing the POS tags. In order to facilitate comparisons, 
instead of using the more accurate tags from our POS network, we use for each task the 
POS tags provided by the corresponding CoNLL challenge. We also report results obtained 
for the SRL task by adding word features representing the CHUNK tags (also provided by 
the CoNLL challenge). We consistently obtain moderate improvements. 



6.4 Ensembles 

Constructing ensembles of classifiers is a proven way to trade computational efficiency for 



generalization performance (Bell et al. , 2007). Therefore it is not surprising that many 



NLP systems achieve state-of-the-art performance by combining the outputs of multiple 



29 



COLLOBERT, WeSTON, BOTTOU, KARLEN, KAVUKCUOGLU AND KUKSA 



Approach 








"MTTTJ 
iM Jl/rv, 












Benchmark Systems 


97.24 


94.29 


89.31 


NN+SLL+LM2+P0S 


worst 


97.29 
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best 


97.35 
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89.86 
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97.37 


94.34 


89.70 


NN+SLL+LM2+P0S 


joined ensemble 


97.30 


94.35 


89.67 



Table 11: Comparison in generalization performance for POS, CHUNK and NER tasks of 
the networks obtained using by combining ten training runs with different initialization. 



classifiers. For instance, Kudo and Matsumoto (|200l|) use an ensemble of classifiers trained 

Winning a challenge is of course a 



3.2.3) 



with different tagging conventions (see Section 
legitimate objective. Yet it is often difficult to figure out which ideas are most responsible 
for the state-of-the-art performance of a large ensemble. 

Because neural networks are nonconvex, training runs with different initial parameters 
usually give different solutions. Table 11 reports results obtained for the CHUNK and 
NER task after ten training runs with random initial parameters. Voting the ten network 
outputs on a per tag basis ( "voting ensemble" ) leads to a small improvement over the average 
network performance. We have also tried a more sophisticated ensemble approach: the ten 
network output scores (before sentence-level likelihood) were combined with an additional 
linear layer Q and then fed to a new sentence-level likelihood (13). The parameters of 
the combining layers were then trained on the existing training set, while keeping the ten 
networks fixed ("joined ensemble"). This approach did not improve on simple voting. 

These ensembles come of course at the expense of a ten fold increase of the running 
time. On the other hand, multiple training times could be improved using smart sampling 



strategies (Neal, 1996) 



We can also observe that the performance variability among the ten networks is not very 
large. The local minima found by the training algorithm are usually good local minima, 
thanks to the oversized parameter space and to the noise induced by the stochastic gradient 



procedure (LeCun et al. , 1998). In order to reduce the variance in our experimental results. 



we always use the same initial parameters for networks trained on the same task (except of 



course for the results reported in Table 11 ) 
6.5 Parsing 



Gildea and Palmer (2002) offer several arguments suggesting that syntactic parsing is a 



necessary prerequisite for the SRL task. The CoNLL 2005 SRL benchmark task provides 



parse trees computed using both the Charniak ( 2000 ) and Collins ( 1999 ) parsers. State-of- 



the-art systems often exploit additional parse trees such as the k top ranking parse trees 



(Koomen et al., 2005 Haghighi et al., 2005) 



In contrast our SRL networks so far do not use parse trees at all. They rely instead 
on internal representations transferred from a language model trained with an objective 
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Figure 6: Charniak parse tree for the sentence "The luxury auto maker last year sold 1,214 
cars in the U.S.". Level is the original tree. Levels 1 to 4 are obtained by successively 
collapsing terminal tree branches. For each level, words receive tags describing the segment 
associated with the corresponding leaf. All words receive tag "O" at level 3 in this example. 



function that captures a lot of syntactic information (see Section 4.6). It is therefore 
legitimate to question whether this approach is an acceptable lightweight replacement for 
parse trees. 

We answer this question by providing parse tree information as additional input features 
to our system. We have limited ourselves to the Charniak parse tree provided with the 
CoNLL 2005 data. Considering that a node in a syntactic parse tree assigns a label 
to a segment of the parsed sentence, we propose a way to feed (partially) this labeled 
segmentation to our network, through additional lookup tables. Each of these lookup 
tables encode labeled segments of each parse tree level (up to a certain depth) . The labeled 



segments are fed to the network following a lOBES tagging scheme (see Sections 3.2.3 
and 3.1.1). As there are 40 different phrase labels in WSJ, each additional tree-related 



lookup tables has 161 entries (40 x 4 + 1) corresponding to the IBES segment tags, plus the 
extra O tag. 

We call level the information associated with the leaves of the original Charniak parse 
tree. The lookup table for level encodes the corresponding lOBES phrase tags for each 
words. We obtain levels 1 to 4 by repeatedly trimming the leaves as shown in Figure [6j We 
labeled "O" words belonging to the root node "S" , or all words of the sentence if the root 
itself has been trimmed. 

Experiments were performed using the LM2 language model using the same network 
architectures (see Table [s]) and using additional lookup tables of dimension 5 for each 
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76.05 
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75.89 
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74.72 
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75.49 



Table 12: Generalization performance on the SRL task of our NN architecture compared 
with the benchmark system. We show performance of our system fed with different levels 
of depth of the Charniak parse tree. We report previous results of our architecture with no 



parse tree as a baseline. Koomen et al. (2005) report test and validation performance using 



six parse trees, as well as validation performance using only the top Charniak parse tree. 
For comparison purposes, we hence also report validation performance. Finally, we report 
our performance with the CHUNK feature, and compare it against a level feature PTO 
obtained by our network. 



\ 



parse tree level. Table 12 reports the performance improvements obtained by providing 
increasing levels of parse tree information. Level alone increases the Fl score by almost 
1.5%. Additional levels yield diminishing returns. The top performance reaches 76.06% Fl 
score. This is not too far from the state-of-the-art system which we note uses six parse 



trees instead of one. Koomen et al. (2005) also report a 74.76% Fl score on the validation 
set using only the Charniak parse tree. Using the first three parse tree levels, we reach this 
performance on the validation set. 



We also reported in Table 12 our previous performance obtained with the CHUNK 



feature (see Table 10). It is surprising to observe that adding chunking features into the 



semantic role labeling network performs significantly worse than adding features describing 



the level of the Charniak parse tree (Table 12). Indeed, if we ignore the label prefixes 
"BIES" defining the segmentation, the parse tree leaves (at level 0) and the chunking 
have identical labeling. However, the parse trees identify leaf sentence segments that are 



often smaller than those identified by the chunking tags, as shown by Hollingshead et al. 
(2005)^ Instead of relying on Charniak parser, we chose to train a second chunking 
network to identify the segments delimited by the leaves of the Penn Treebank parse trees 
(level 0). Our network achieved 92.25% Fl score on this task (we call it PTO), while we 



16. As in (Hollingshead et al. 20051, consider the sentence and chunk labels "(NP They) (VP are starting 
to buy) (NP growth stocks)". The parse tree can be written as "(S (NP They) (VP are (VP starting (S 
(VP to (VP buy (NP growth stocks)))))))". The tree leaves segmentation is thus given by "(NP They) 
(VP are) (VP starting) (VP to) (VP buy) (NP growth stocks)". 
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evaluated Charniak performance as 91.94% on the same task. As shown in Table 12 , feeding 
our own PTO prediction into the SRL system gives similar performance to using Charniak 
predictions, and is consistently better than the CHUNK feature. 



6.6 Word Representations 

In Section [4j we adapted our neural network architecture for training a language model task. 
By leveraging a large amount of unlabeled text data, we induced word embeddings which 
were shown to boost generalization performance on all tasks. While we chose to stick with 



one single architecture, other ways to induce word representations exist. Mnih and Hinton 



(2007) proposed a related language model approach inspired from Restricted Boltzmann 
Machines. However, word representations are perhaps more commonly infered from n-gram 
language modelling rather than smoothed language models. One popular approach is the 

which builds hierachical word clusters 
The induced word representation has 



Brown clustering algorithm (Brown et al. , 1992aj 



by maximizing the bigram's mutual information, 
been used with success in a wide variety of NLP tasks. 



including POS (Schiitze, 1995), 



NER (Miller et al. 2004 Ratinov and Roth, 2009), or parsing (Koo et al. 2008). Other 



related approaches exist, like phrase clustering (Lin and Wu 2009) which has been shown 



to work well for NER. Finally, Huang and Yates ( 2009 ) have recently proposed a smoothed 



language modelling approach based on a Hidden Markov Model, with success on POS and 
Chunking tasks. 

While a comparison of all these word representations is beyond the scope of this paper, 
it is rather fair to question the quality of our word embeddings compared to a popular NLP 
approach. In this section, we report a comparison of our word embeddings against Brown 
clusters, when used as features into our neural network architecture. We report as baseline 
previous results where our word embeddings are fine-tuned for each task. We also report 
performance when our embeddings are kept fixed during task-specific training. Since convex 
machine learning algorithms are common practice in NLP, we finally report performances 
for the convex version of our architecture. 

For the convex experiments, we considered the linear version of our neural networks 
(instead of having several linear layers interleaved with a non-linearity). While we always 
picked the sentence approach for SRL, we had to consider the window approach in this 
particular convex setup, as the sentence approach network (see Figure [2]) includes a Max 
layer. Having only one linear layer in our neural network is not enough to make our 
architecture convex: all lookup-tables (for each discrete feature) must also be fixed. The 
word-lookup table is simply fixed to the embeddings obtained from our language model 
LM2. All other discrete feature lookup-tables (caps, POS, Brown Clusters...) are fixed to a 
standard sparse representation. Using the notation introduced in Section 3.1.1 if LTy^k is 
the lookup-table of the k^^ discrete feature, we have G M'^ I^P I and the representation 
of the discrete input w is obtained with: 



k\l 



0,---0, 



1 

at index w 



0, ••• 



(19) 



Training our architecture in this convex setup with the sentence-level likelihood ( |13| ) 
corresponds to training a CRF. In that respect, these convex experiments show the 
performance of our word embeddings in a classical NLP framework. 
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POS 


CHUNK 


NER 


kJXVlj 




(PWA) 


(Fl) 


(Fl) 


('Fl 




Non-convex Approach 




LMz (^non-lmear IN IN j 


97.29 


94.32 


89.59 


m.Uo 


TjA/T2 ( non-liTiPrir IVIV fixprl pmliprlrlinp's i 


97.10 


94.45 


88.79 


72.24 


Brown Clusters (non-linear NN, 130K words) 


96.92 


94.35 


87.15 


72.09 


Brown Clusters (non-linear NN, all words) 


96.81 


94.21 


86.68 


71.44 






Convex Approach 




LM2 (linear NN, fixed embeddings) 


96.69 


93.51 


86.64 


59.11 


Brown Clusters (linear NN, 130K words) 


96.56 


94.20 


86.46 


51.54 


Brown Clusters (linear NN, all words) 


96.28 


94.22 


86.63 


56.42 



Table 13: Generalization performance of our neural network architecture trained with 
our language model (LM2) word embeddings, and with the word representations derived 
from the Brown Clusters. As before, all networks are fed with a capitalization feature. 
Additionally, POS is using a word suffix of size 2 feature, CHUNK is fed with POS, NER 
uses the CoNLL 2003 gazetteer, and SRL is fed with levels 1-5 of the Charniak parse tree, 
as well as a verb position feature. We report performance with both convex and non-convex 
architectures (300 hidden units for all tasks, with an additional 500 hidden units layer for 
SRL). We also provide results for Brown Clusters induced with a 130K word dictionary, as 
well as Brown Clusters induced with all words of the given tasks. 



Following the Ratinov and Roth ( 2009 ) and Koo et al. ( 2008 ) setups, we generated 1, 000 
Brown clusters using the implementatior ^ from Liang (2005). To make the comparison 



fair, the clusters were first induced on the concatenation of Wikipedia and Reuters datasets, 
as we did in Section [4] for training our largest language model LM2, using a 130K word 
dictionary. This dictionary covers about 99% of the words in the training set of each task. 
To cover the last 1%, we augmented the dictionary with the missing words (reaching about 
140K words) and induced Brown Clusters using the concatenation of WSJ, Wikipedia, and 
Reuters. 

The Brown clustering approach is hierarchical and generates a binary tree of clusters. 
Each word in the vocabulary is assigned to a node in the tree. Features are extracted from 
this tree by considering the path from the root to the node containing the word of interest. 
Following Ratinov & Roth, we picked as features the path prefixes of size 4, 6, 10 and 20. In 
the non-convex experiments, we fed these four Brown Cluster features to our architecture 
using four different lookup tables, replacing our word lookup table. The size of the lookup 
tables was chosen to be 12 by validation. In the convex case, we used the classical sparse 



representation (19), as for any other discrete feature. 



We first report in Table 13 generalization performance of our best non-convex networks 
trained with our LM2 language model and with Brown Cluster features. Our embeddings 
perform at least as well as Brown Clusters. Results are more mitigated in a convex setup. 
For most task, going non-convex is better for both word representation types. In general. 



17. Available at http: //www. eecs .berkeley . edu/-pliang/sof twarej 
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Task 


Features 


POS 


Suffix of size 2 


CHUNK 


POS 


NER 


CoNLL 2003 gazetteer 


PTO 


POS 


SRL 


PTO, verb position 



Table 14: Features used by SENNA implementation, for each task. In addition, all tasks 
use "low caps word" and "caps" features. 



"fine-tuning" our embeddings for each task also gives an extra boost. Finally, using a better 



word coverage with Brown Clusters ("all words" instead of "130K words" in Table 13) did 
not help. 

More complex features could be possibly combined instead of using a non-linear 



model. For instance, Turian et al. (2010) performed a comparison of Brown Clusters and 
embeddings trained in the same spirit as ours^* with additional features combining labels 



and tokens. We believe this type of comparison should be taken with care, as combining 
a given feature with different word representations might not have the same effect on each 
word representation. 



V 



6.7 Engineering a Sweet Spot ^ 

We implemented a standalone version of our architecture, written in the C language. 
We gave the name "SENNA" (Semantic/syntactic Extraction using a Neural Network 
Architecture) to the resulting system. The parameters of each architecture are the ones 
described in Table [5j All the networks were trained separately on each task using the 
sentence- level likelihood (SLL). The word embeddings were initialized to LM2 embeddings, 
and then fine-tuned for each task. We summarize features used by our implementation 
in Table 14, and we report performance achieved on each task in Table [TSj The runtime 



versior 



19 



contains about 2500 lines of C code, runs in less than 150MB of memory, and needs 



less than a millisecond per word to compute all the tags. Table 16 compares the tagging 



speeds for our system and for the few available state-of-the-art systems: the Toutanova et al. 



( |2003| ) POS taggeQ the |Shen et al] ( |2007[ ) POS taggeiQand the |Koomen et al.| ( |2005[ ) SRL 



system All programs were run on a single 3GHz Intel core. The POS taggers were run 



with Sun Java 1.6 with a large enough memory allocation to reach their top tagging speed. 



18. 



However they did not reach our embedding performance. There are several differences in how they 
trained their models that might explain this. Firstly, they may have experienced difficulties because 
they train 50-dimensional embeddings for 269K distinct words using a comparatively small training set 
(RCVl, 37M words), unlikely to contain enough instances of the rare words. Secondly, they predict the 



correctness of the final word of each window instead of the center word (Turian et al. 20101, effectively 
restricting the model to unidirectional prediction. Finally, they do not fine tune their embeddings after 
unsupervised training. 



http: //ml .nec-labs . com/senna 



19. Available at 

20. Available at 

21. Available at http : //www . cis . upenn . edu/~xtag/spinal 

22. Available at http: //12r . cs .uiuc . edu/~cogcomp/asof tware .php?skey=SRL 



ittp : //nip . Stanford, edu/softwajre/t agge r . shtml We picked the 3.0 version (May 2010) 
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Task 


Benchmark 


SENNA 


Part of Speech (POS) 


(Accuracy) 


97.24 % 


97.29 % 


Chunking (CHUNK) 


(Fl) 


94.29 % 


94.32 % 


Named Entity Recognition (NER) 


(Fl) 


89.31 % 


89.59 % 


Parse Tree level (PTO) 


(Fl) 


91.94 % 


92.25 % 


Semantic Role Labeling (SRL) 


(Fl) 


77.92 % 


75.49 % 



Table 15: Performance of the engineered sweet spot (SENNA) on various tagging tasks. The 
PTO task replicates the sentence segmentation of the parse tree leaves. The corresponding 
benchmark score measures the quality of the Charniak parse tree leaves relative to the Penn 
Treebank gold parse trees. 



POS System 


RAM (MB) Time (s) 




Toutanova et a 


. (2003) 


800 64 
2200 833 


Shen et al. ( 


2007) 


SENNA 
SRL System 


32 4 
RAM (MB) Time (s) 




Koomen et al. 


(2005) 


3400 6253 


SENNA 


124 51 



Table 16: Runtime speed and memory consumption comparison between state-of-the-art 
systems and our approach (SENNA). We give the runtime in seconds for running both 
the POS and SRL taggers on their respective testing sets. Memory usage is reported in 
megabytes. 



The beam size of the Shen tagger was set to 3 as recommended in the paper. Regardless 
of implementation differences, it is clear that our neural networks run considerably faster. 
They also require much less memory. Our POS and SRL tagger runs in 32MB and 120MB 
of RAM respectively. The Shen and Toutanova taggers slow down significantly when the 
Java machine is given less than 2.2GB and 800MB of RAM respectively, while the Koomen 
tagger requires at least 3GB of RAM. 

We believe that a number of reasons explain the speed advantage of our system. First, 
our system only uses rather simple input features and therefore avoids the nonnegligible 
computation time associated with complex handcrafted features. Secondly, most network 
computations are dense matrix-vector operations. In contrast, systems that rely on a great 
number of sparse features experience memory latencies when traversing the sparse data 
structures. Finally, our compact implementation is self-contained. Since it does not rely on 
the outputs of disparate NLP system, it does not suffer from communication latency issues. 

7. Critical Discussion 

Although we believe that this contribution represents a step towards the "NLP from scratch" 
objective, we are keenly aware that both our goal and our means can be criticized. 
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The main criticism of our goal can be summarized as follows. Over the years, the NLP 
community has developed a considerable expertise in engineering effective NLP features. 
Why should they forget this painfully acquired expertise and instead painfully acquire 
the skills required to train large neural networks? As mentioned in our introduction, we 
observe that no single NLP task really covers the goals of NLP. Therefore we believe that 
task-specific engineering (i.e. that does not generalize to other tasks) is not desirable. But 
we also recognize how much our neural networks owe to previous NLP task-specific research. 

The main criticism of our means is easier to address. Why did we choose to rely on a 
twenty year old technology, namely multilayer neural networks? We were simply attracted 
by their ability to discover hidden representations using a stochastic learning algorithm 
that scales linearly with the number of examples. Most of the neural network technology 
necessary for our work has been described ten years ago (e.g. Le Cun et al. , 1998). However, 
if we had decided ten years ago to train the language model network LM2 using a vintage 
computer, training would only be nearing completion today. Training algorithms that scale 
linearly are most able to benefit from such tremendous progress in computer hardware. 



8. Conclusion 

We have presented a multilayer neural network architecture that can handle a number of 
NLP tasks with both speed and accuracy. The design of this system was determined by 
our desire to avoid task-specific engineering as much as possible. Instead we rely on large 
unlabeled datasets and let the training algorithm discover internal representations that 
prove useful for all the tasks of interest. Using this strong basis, we have engineered a fast 
and efficient "all purpose" NLP tagger that we hope will prove useful to the community. 
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Appendix A. Neural Network Gradients 

We consider a neural network fg{ ), with parameters 6. We maximize the hkehhood ([s]), or 
minimize ranking criterion ( |18| ), with respect to the parameters d, using stochastic gradient. 
By negating the hkehhood, we now assume it ah corresponds to minimize a cost C{fQ{-)), 
with respect to 9. 



Fohowing the classical "back- propagation" derivations (LeCun, 1985, Rumelhart et al 



1986) and the modular approach shown in ( [Bottou 1991[ ), any feed- forward neural network 



with L layers, like the ones shown in Figure [T] and Figure [2j can be seen as a composition 
of functions corresponding to each layer /: 

^(•) = /^(/^n- ••/.'(•)■••)) ^ 

Partionning the parameters of the network with respect to each layers 1 < / < L, we write: 



! ^ 1 



We are now interested in computing the gradients of the cost with respect to each OK 
Applying the chain rule (generalized to vectors) we obtain the classical backpropagation 
recursion: 

dC ^ dJ^dC y 

del QQi Qfi^ r ^^^^ 



dc dn dC 



dfg~' off' dfl 



(21) 



In other words, we first initialize the recursion by computing the gradient of the cost with 
respect to the last layer output dC /df^ . Then each layer / computes the gradient respect 



to its own parameters with (20), given the gradient coming from its output dC/dfg. To 



perform the backpropagation, it also computes the gradient with respect to its own inputs, 



as shown in (21). We now derive the gradients for each layer we used in this paper. 

Lookup Table Layer Given a matrix of parameters 9^ = and word (or discrete 
feature) indices [wji , the layer outputs the matrix: 

The gradients of the weights {W)j are given by: 

dC _ 9C 1 

{i<t<T/[w\,=i} ""^e 

This sum equals zero if the index i in the lookup table does not corresponds to a word in 
the sequence. In this case, the i**^ column of W does not need to be updated. As a Lookup 
Table Layer is always the first layer, we do not need to compute its gradients with respect 
to the inputs. 
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Linear Layer Given parameters 9^ = and an input vector fg ^ the output is 

given by: 



f9 = W'ff' + b'. 

The gradients with respect to the parameters are then obtained with: 



(22) 



dC 



dC 



fe 



T 8C dC 



(23) 



and the gradients with respect to the inputs are computed with: 



dC 



i-i 



T dc 



(24) 



Convolution Layer Given a input matrix fg ^, a Convolution Layer fg{-) applies a 
Linear Layer operation ( |22[ ) successively on each window {fg~^)t'^^" ^ t < T) of size 
dwin- Using (23), the gradients of the parameters are thus given by summing over all 
windows: 



dC 



E 

t=i 



t 



T ^ dc 



T 



t=i 



After initializing the input gradients dC/dfg ^ to zero, we iterate (24) over all windows for 
1 < t < T, leading the accumulatior^^ 



I \ '-''Win I _ 



Max Layer Given a matrix fg ^, the Max Layer computes 



max 

t 



in 



t 



and Oj = argmax 



t 



Vi , 



where at stores the index of the largest value. We only need to compute the gradient with 
respect to the inputs, as this layer has no parameters. The gradient is given by 



idc_\i 



if t = ttj 
otherwise 



HardTanh Layer Given a vector fi ^ , and the definition of the HardTanh (|5l) we get 



dC 

i-i 





dC 



if 



10 



< -1 



9f'e 




if - 1<= 



d-1 



<= 1 



if 



10 



> 1 



if we ignore non-differentiability points. 



23. We denote "+=" any accumulation operation. 
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Word-Level Log-Likelihood The network outputs a score [fe]i for each tag indexed by 



i. FoUowing (11), if y is the true tag for a given example, the stochastic score to minimize 



can be written as 



C(/,)=logadd[/,]^.-[/,]^ 
j 



Considering the definition of the logadd (10), the gradient with respect to fg is given by 



dC 

- li=y Vi. 



Sentence-Level Log-Likelihood The network outputs a matrix where each element 
[/q]. ^ gives a score for tag i at word t. Given a tag sequence [y]J and a input sequence [x]J, 



we maximize the likelihood (13), which corresponds to minimizing the score 



C{fe,A) = logadd , 9) -s{[x]f, , 6) , 

^ V ' 

^logadd 

with 

t=i 

We first initialize all gradients to zero 

dC dC 

—Yj--. — = Vi, t and = Mi, j . 

We then accumulate gradients over the second part of the cost — ^([x]^, [yjj , 9), which 
gives: 

dC 

— +=1 



dc 

We now need to accumulate the gradients over the first part of the cost, that is Ciogadd- 



We differentiate Ciogadd by applying the chain rule through the recursion (14). First we 
initialize our recursion with 

We then compute iteratively: 

dCiogadd ^ ^ dCiogaM e'^~^^'^ + ^'''^^^ ^ 
d6t-i{i) ^ d6t{j) J^ke''-'^'^^^^^^'^' 
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where at each step t of the recursion we accumulate of the gradients with respect to the 
inputs fo, and the transition scores [A]- ■: 



dC 



dC 



+-- 



dCiogadd d5t{i) 



dC, 



logadd 



dCiogadd ddtjj) 



-+ 



logadd 



dStii) 



d6t{j) ^,,e^'-^(^)+[^]^> 



Ranking Criterion We use the ranking criterion (18) for training our language model. 
In this case, given a "positive" example x and a "negative" example x^'"\ we want to 
minimize: 

C{fe{x), fe{x^)) = max { , 1 - fe{x) + feix^""^) } . ' (26) 



Ignoring the non-differentiability of max{0 
/ dC \ 



dC 



) in zero, the gradient is simply given by: 
if l-/e(x) + /e(x("')) >0 
otherwise 



V 
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